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Abstract 



There is a growing interest in optimizations that depend on or benefit from 
an execution profile that tells where time is spent. How well does a profile 
from one run describe the behavior of a different run, and how does this 
compare with the behavior predicted statically by examining the program it- 
self? This paper defines two abstract measures of how well a profile predicts 
actual behavior. According to these measures, real profiles indeed do better 
than estimated profiles, usually. A perfect profile from an earlier run with 
the same data set, however, does better still, sometimes by a factor of two. 
Using such a profile is unrealistic, and can lead to inflated expectations of a 
profile-driven optimization. 



1. Introduction 

Many people have built or speculated on systems that use a run-time profile to 
guide code optimization. Applications include the selection of variables to promote to 
registers [7,8], placement of code sequences to improve cache behavior [3,6], and pred- 
iction of common control paths for optimizations across basic block boundaries [2,5]. 

When such work is presented, two questions are often asked but seldom ade- 
quately answered. How well does a profile from one run predict the behavior of 
another? And how well can you do with an estimated profile derived from static 
analysis of the program? It is important to answer these questions in general terms as 
well as specific. A profile from a different run may be very useful for one kind of 
optimization but nearly useless for another kind. The optimization may require identi- 
fying the specific program entities that are most used, or it may require only identifying 
some that are used a lot. 

This paper describes a study of how well an estimated profile predicts real 
behavior, and how well a profile from one run predicts the behavior of another run. 

2. Methodology. 

The pixie tool from Mips [4] instruments an executable file with basic block 
counting; when the instrumented program is run, it produces a table telling how many 
times each basic block was executed. From this table, in combination with static infor- 
mation from the executable file, we derive four kinds of profiles. The first is the basic 
block profile, which is just the mapping from each basic block to its execution count. 
The second is the procedure profile, which maps each procedure to the number of 
times it is entered. The third is the call profile, which maps each distinct call site to 
the number of times it is executed. The last is the global variable profile, which maps 
each global variable to the number of times it is directly referenced. 

If we don't have basic block counts from pixie, we can try to estimate them. We 
first divide the program into basic blocks, and connect them into procedures and flow 

graphs based on the branch structure. We then identify the loops by computing the 
dominator relation and finding the back edges, edges each of whose tail dominates its 
head. A loop consists of the set of back edges leading to a single dominator, together 
with the edges that appear on any path from the dominator to the head of one of the 
back edges [1]. We also build a static call graph by finding all the direct calls in the 

* The Mips code generation is stylized enough that we can recognize indirect jumps that represent case- 
statements, and can deduce what the possible successor blocks are. 
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program; this graph will not include calls through procedure variables. 

Given this information, we considered four different ways of estimating basic 
blocks counts. The first is the loop-only estimate, in which a block's count is initially 
1 and is multiplied by 3 for each loop that contains it; this ignores the effects of the call 
graph. The second is the leaf-loop estimate, in which the loop-only count is multiplied 
by 1024 if the block is contained in a leaf procedure, 512 if it is no more than one from 
a leaf procedure, and so on with powers of 2 up to 1. The third is the call-loop esti- 
mate, in which the loop-only count is multiplied by the static number of direct calls of 
the block's procedure. The fourth is the call+l-loop estimate, which is the loop-only 
count is multiplied by one more than the static number of direct calls of the block's 
procedure. The call+l-loop estimate is like the call-loop estimate, except that pro- 
cedures that are called only indirectly will not be shut out altogether; unfortunately pro- 
cedures that are never called are similarly readmitted. 

An optimizer would use a profile by selecting the most frequent entries in it and 
doing something special to them: promoting them to registers, optimizing them extra 
hard, or whatever. The question is how well a candidate profile, real or estimated, 
predicts the behavior described by a reference profile. For this study we considered two 
abstract methods of evaluating a candidate profile. 

The first method, specific matching, is to take the top n entries of the candidate 
profile and see how many of them are also in the top n entries of the reference profile. 
For instance, consider the procedure profiles in Figure 1. If we let n = 8, we see that 
the first 8 members of the candidate profile include 5 of the first 8 members of the 
reference profile. Thus the candidate profile gets a score of 5/8, or 0.625. 
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Figure 1. Candidate profile (left) and reference profile. 



The second method, frequency matching, is to take the top n entries of the candi- 
date profile and look up their frequencies in the reference profile, and then compare the 
total to the total of the top n entries of the reference profile. For example, taking the 
profile in Figure 1 and again assuming n = 8, the total of the candidate's top 8 entries 
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as revealed by the reference profile is 553250, while the total of the reference profile's 
top 8 entries is 1513681. By this measure, then, the candidate profile gets a score of 
553250/1513681, or 0.365. Note that specific matching is symmetric (we get the same 
score comparing A to B as comparing B to A), but frequency matching is asymmetric. 

Applying this approach to all four kinds of profiles, for different values of n, 
should give us some notion of how well one profile might predict another. To apply 
this understanding more specifically, we also did some rough computations of the sta- 
bility of the profiles when applied in two specific ways. One application is the promo- 
tion of global variables to registers. The other is intensive optimization of the most fre- 
quently called procedures. 

We should note one important limitation of this approach. It does not address the 
stability of a profile over successive versions of the same program undergoing develop- 
ment. One would expect that some kinds of profiles, such as global variable use or pro- 
cedure invocation, might be relatively stable even when the program is modified. One 
might argue that a program under development will not be run enough times to merit 
profile-based optimization, but it would still be interesting to know whether it would be 
feasible. A thorough study of this question may be in order, but is not considered here. 

3. Programs and data used 

Our test suite consists of eleven programs. Two of them, a text editor and a draw- 
ing editor, are interactive. Two are CAD tools used at WRL. Two are different C com- 
piler front ends; one is recursive descent, the other yacc-based. Three of them are 
SPEC benchmarks. Figure 2 describes the complete test suite. 



program description 

bisim multi-level machine simulator 

bitv timing verifier 

udraw drawing editor 

egrep file searcher 

sed stream editor 

Gosling emacs text editor 

yacc parser generator 

ccom Titan C front end 

gccl gnu C front end 

eqntott truth table generator 

espresso set operation benchmark 



Figure 2. The eleven test programs. 

Wherever possible, we gave the programs quite different input data, in the hopes 
of maximizing the differences in their behavior. We ran bisim three different ways: 
completely high-level simulation, high-level functional units with a transistor-level 
register file, and transistor-level functional units with a high-level register file. Bitv 
was run to verify a datapath, a register file, and a write buffer. The drawing editor was 
used to draw schematics and also a home landscape design. Egrep and sed were run 
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with both simple and complicated patterns, and with large and small inputs. Emacs was 
used to edit source files, English text files, and very long simulation configuration files. 
Yacc was used with a high-level language grammar, an intermediate language grammar, 
and a command grammar for a window manager. The two C compilers were both run 
with two source files written by humans and two source files generated by the C++ 
front end. The eqntott and espresso benchmarks from SPEC were run with inputs pro- 
vided by SPEC. 
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Figure 3. Average specific matching score. 



4. Results 

4.1. Specific matching 

Our first result assumes that we score candidate profiles by the specific matching 
criterion, for n = 1, 2, 4, 8, 16, 32, 64, and 128. Given a test program and a value of 
n, we proceeded as follows. An estimated profile was scored against each real profile 
for the same test program; we then averaged these scores. Each real profile was scored 
against each of the other real profiles, but not against itself; we then averaged all the 
scores comparing two real profiles. For each test program and each value of n, this 
gave us 20 scores: the cross product of four profile classes and five estimate classes 
(more precisely, four estimate classes and also real profiles from other runs). We then 
averaged these scores over all programs; this double averaging gave each program equal 
weight even though some had more datasets than others. 

The results are shown in Figure 3. The fraction of the circle filled with black is 
the score, so a completely black circle is perfect and a completely white circle is terri- 
ble. We can see that predicting which globals will be used is fairly easy, probably 
because there are fewer of them than there are of the other profiled entities. The call- 
loop estimates do rather better than the other estimates. As we would expect, actual 
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profiles do considerably better than estimates, but even actual profiles do disappoint- 
ingly badly at predicting which basic blocks will be executed most. 
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Figure 4. Average frequency matching score. 



4.2. Frequency matching 

Our next result has the same structure as the previous result, but it assumes fre- 
quency matching instead of specific matching. Again, we used n = 1, 2, 4, 8, 16, 32, 
64, and 128. Each profile's scores were again averaged over all the profiles it was com- 
pared against, and the resulting averages were again averaged over the eleven test pro- 
grams. 

The results are shown in Figure 4. We were rather more successful at frequency 
matching than at specific matching.* The trends, however, are much the same: globals 
are easy to predict, blocks are hard, call-loop estimates work better than the others, and 
actual profiles work best of all. 

4.3. Differences between test programs 

There is a substantial variation in the predictability of the different programs. Fig- 
ure 5 shows the average score for real (not estimated) profiles, using the frequency 
matching criterion. This is the fifth and tenth rows of Figure 4, broken down by pro- 
gram. Emacs is astonishingly predictable, perhaps because it is built around a Lisp 
interpreter, so that much of its control logic (and thus much of its variability) is hidden 
in the data structure. This argument would lead us to suppose that gccl, with a table- 
driven parser, might be more predictable than ccom, with a recursive descent parser. 

* This is not guaranteed in general: the candidate profile in Figure 1, for example, got a better score at 
specific matching. 
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But in fact ccom is noticeably more predictable than gccl. The least predictable pro- 
grams are sed and eqntott, which is a little surprising because they are among the smal- 
lest. 
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Figure 5. Average frequency matching scores for real profiles. 
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4.4. Global register allocation 

To apply this technique to a realistic specific example, let us suppose that we sud- 
denly have eight registers available that we can use to promote eight global variables or 
constants. They payoff of doing this is that all the loads and stores of the globals we 
select will be removed. We can estimate our improvement in performance by counting 
the executions of these loads and stores and dividing the total by the total number of 
instructions executed.* We did this both for a reference profile (to see how well we 
could possibly have done) and for a candidate profile, in each case computing the 
counts using the reference profile. 
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Figure 6. Improvement from global register allocation. 

The results are shown in Figure 6. This optimization by itself doesn't do a lot for 
performance: even if magically driven by the counts from the reference profile, the 
improvement in performance is only 2.7%. A good estimated profile gives us about 
half of the maximum possible performance improvement, and an actual profile gives us 
about 85% of the maximum. 

4.5. Selective intensive optimization 

As a second specific example, let us suppose we have an excellent but expensive 
optimization algorithm that will cut the execution time of any procedure in half, but 
that is so expensive that we can apply it only to 5% of our procedures. We will select 
as the procedures to optimize those we believe will be invoked most often, by picking 
the first 5% of the entries in the procedure invocation profile. As before, we will do 
this both for a candidate profile and also for a reference profile; we will compute the 
improvement in performance using only the counts from the reference profile. 

The results are shown in Figure 7. This optimization would speed up our pro- 
grams by a third if it were driven by a perfect profile. A real profile gives us about 
three-fourths of that, but even the best estimated profile — which oddly enough was the 
simple loop-only estimate — gives us barely one-fourth. 

* This does not take pipeline stalls into account, nor does it consider cache effects, which are likely to 
increase the benefit of promoting globals to registers. It also assumes that the globals selected are not 
ineligible because of aliasing. We are interested only in rough numbers here, as an example. 
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Figure 7. Improvement from selective intensive optimization. 



5. Conclusions 

Real profiles from different runs worked much better than the estimated profiles 
discussed in this paper. The best estimations were usually those that combined loop 
nesting level with static call counts. Basing the estimate on the procedure's distance 
from leaves of the call graph was less effective. There may of course still be better 
ways to estimate a profile: this is an interesting open question both in the general case 
and in specific applications. 

Even a real profile was never as good as a perfect profile from the same run being 
measured. It was often quite close, however, and was usually at least half as good. 
Profile-based optimization would seem to have a future, but we must be careful how we 
measure it, lest we expect more than it can really deliver. 
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